a \ Middle East Journal of Applied Science & Technology (MEJAST) 
M E J A S T Volume 6, Issue 2, Pages 35-45, April-June 2023 


P. Sujitha'’ & R. Vanitha” 


UG Student, *Assistant Professor, '?Department of CSE, IFET College of Engineering, Villupuram, India. 
Corresponding Author (P.Sujitha) Email: sujithaarull4@ gmail.com* oa Crossref 


DOI: https://doi.org/10.4643 1/MEJAST.2023.6205 


Copyright © 2023 P.Sujitha & R.Vanitha. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which 
permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. 


ABSTRACT 


a 1. Introduction 


Financial fraud is a problem that is getting worse and has far-reaching effects on the government, businesses, and 
financial sector. Credit card transactions have increased in the modern world due to a strong reliance on internet 
technology, yet credit card fraud has also increased both online and offline. Recent computational approaches have 
received attention as credit card transactions become a common form of payment for goods and services. The 
prevention of frauds in industries and enterprises including credit card, retail, e-commerce, insurance, and others is 
made possible by a variety of fraud detection tools and software. One famous and well-liked solution for addressing 


the issue of credit fraud detection is machine learning. 


Absolute certainty regarding the genuine intention and legality of a request or transaction is unattainable. Fraud 
may come from a credit card that has been lost, stolen, or fraudulently fabricated. Due to the rise in online 
purchasing, card-not-present fraud—or the use of your credit card number in e-commerce transactions—has also 
increased in frequency. The growth of e-banking and various online payment environments has led to an increase in 
fraud, such as CCF, causing billions of dollars in losses annually. CCF detection has emerged as one of the key 
objectives in this era of digital payments. As a company owner, it is indisputable believe a cashless society is the 
way of the future. As a result, conventional payment methods won't be employed in the future and won't be useful 
for growing a firm. In actuality, utilizing mathematical algorithms is the most efficient way to look for potential 
signs of fraud in the data that is accessible. Credit card fraud detection is actually the process of classifying 
fraudulent transactions into two categories: legitimate transactions and fraudulent transactions. To detect credit 
card fraud, a number of techniques have been developed and put into use including Decision Tree, Random Forest, 
Logistic Regression, and Extreme Boosting Algorithm (XG Boost) which is used for comparison analysis. Datasets 


for credit card transactions are infrequent, wildly unbalanced, and distorted. The most crucial component of 
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machine learning to assess the effectiveness of strategies on skewed credit card fraud data is selecting the best 
feature (variables) for the models and choosing the right metric. Credit card detection faces a variety of difficulties, 
including the fact that the profile of fraudulent behavior is dynamic, meaning that fraudulent transactions 
frequently resemble valid ones. The effectiveness of credit card fraud detection is significantly impacted by the 
choice of variables, sampling strategy, and detection method. In the end, assessments of the outcomes of classifier 
evaluation testing are compiled. According to the results of the trials, XG Boost has a 99.94% accuracy rate. 
However, when all of the classifiers’ learning curves are compared, it becomes clear that XG Boost overfits while 
Random Forest, Logistic Regression, and Decision Tree underfit. The accuracy of XG Boost is higher than that of 
every algorithm. Hence we conclude that XG Boost (Extreme Gradient Boosting). Hence we conclude that XG 


Boost (Extreme Gradient Boosting) is the best model for our system. 
“2. Related Works 


There have been numerous research studies in the area of credit card fraud detection. This section contains many 
research papers focused on detecting credit card fraud. Additionally, we emphasise significantly the research that 
revealed fraud detection in the issue of class imbalance. Credit cards are detected using a variety of methods. 
Y.Abakarim [1] enforced as a result, the primary methodologies can be categorised into areas such machine 
learning (ML), credit card fraud detection, ensemble and feature ranking, and user authentication approaches 
[1],[3]. Each of ML's various branches can handle a variety of learning tasks. However, there are various 


framework types for ML learning. 


V.Arora [3], enforced a remedy for credit card fraud is offered by the ML technique, such as random forest (RF). 
The random forest is the decision tree's ensemble. The RF method is used by most studies. We can utilise network 
analysis and (RF) to merge the model. This approach is known as APATE [1]. Different machine learning (ML) 
methods, including supervised learning and unsupervised methods, are available to researchers. For CCF 
identification, ML techniques like LR, ANN, DT, SVM, and NB are frequently employed. To build reliable 
detection classifiers, the researcher can combine these strategies with ensemble techniques [3]. An artificial neural 
network is a collection of connected neurons and nodes. An input layer, an output layer, and one or more hidden 
layers are only a few of the layers that make up a feed-forward perceptron multilayer. The output layer offers the 
algorithm's response at the end. To reduce inaccuracy, the training set was previously used with weights in the first 
set. H.Abdi [2] enforced these weights were all modified using intricate methods like supervised and multilayer 
perceptron like backpropagation [6]. V.N.Dornadula [10], I-Benchaji [11] enforced regression problem and a 
support vector machine (SVM) are both used in the linear classification model. We may determine the points from 
both classes that are closest to the line using the SVM technique. The integration of supervised and unsupervised 
techniques for the classification of credit card fraud detection is the focus of this research. K.Kirasic [7] on research 
of Random forest vs logistic regression, compared the analysis of RF and LR used Binary classification for 
heterogeneous settings. Khatri et al. [9] enforced several ML algorithms for credit card fraud discovery. In this 
exploration, the authors enforced the ensuing styles Decision Tree (DT), k-Nearest Neighbor (KNN), Logistic 
Retrogression (LR), Random Forest (RF), and Naive Bayes (NB). To estimate the ML-grounded credit card fraud 


discovery models, the experimenters used a dataset that was generated from European cardholders in 2013 also, the 
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authors considered the perceptivity and the perfection as the main performance criteria. The results showed that 
the KNN algorithm achieved the most optimal results with a perfection of 91.11 and a perceptivity of 81.19. 
Rajora et al. [10] conducted a relative exploration of ML. styles for credit card fraud discovery using the 
European cardholders dataset. Some of the styles that were delved include the RF and the kKNN styles. The authors 
considered the accuracy and the area under the curve (AUC) as the main performance criteria. The results 
demonstrated that RF algorithm achieved an accuracy of anda AUC of 0.94. In discrepancy, the KNN obtained an 
accuracy of 93.2 and a AUC of 0.93. Although these results are promising, this exploration didn't probe the class 
imbalance issue that exists in the dataset that was used. Trivedi et al. [11] proposed an effective credit card fraud 
discovery machine using ML styles. In this exploration, the authors considered numerous supervised ML ways 
including grade Boosting (GB) and Random Forest (REF). The authors estimated these styles using the European 
cardholders dataset. The performance criteria used to assess the effectiveness of the proposed approaches include 
the accuracy and the perfection. The outgrowth of the trials showed that the GB attained an accuracy of 94.01 
anda perfection of 93.99. On the other hand, the RF achieved an accuracy of 94.00 and a perfection of 95.98. 


Tanouz et al. [12] presented a credit card fraud discovery frame using ML algorithms. In this exploration, the 
authors used the European cardholders dataset to assess the performance of the proposed styles. also, the authors 
enforced an under- slice fashion to break the issue of class imbalance that live in the dataset that was used. The 
ML styles considered in this work include the RF and LR. The experimenters used the accuracy as the main 
performance metric. The results demonstrated that the RF approach achieved a fraud discovery accuracy of 91.24. 
In discrepancy, the LR method attained an accuracy of 95.16. Likewise, the authors reckoned the confusion 
matrix to assert whether these proposed styles performed optimally for the positive and negative classes. The 
results showed that the class imbalance issue that live in the European credit card holder dataset requires farther 
disquisition. 

Riffi et al. [13] enforced a credit card fraud discovery machine using the Extreme Learning Machine( ELM) and 
Multilayer Perceptron (MLP) algorithms. Both the ELM and MLP are artificial neural networks( ANNs); still, they 
differ in terms of internal armature. The authors used the fraud discovery accuracy as the main performance metric. 
The results demonstrated that the MLP system achieved an accuracy of 97.84. In discrepancy, the ELM attained 
credit card fraud discovery accuracy of 95.46. This work concluded that the MLP outperformed the ELM; still, 


the ELM is less complex in comparison to the MLP. 


Randhawa et al. [14] proposed a credit card fraud discovery machine using Adaptive Boosting and Majority 
Voting styles. In this exploration, the authors used the European cardholders dataset. also, the authors considered 
the AdaBoost method in convergences with ML styles similar as the Support Vector Machine (SVM). In the 
trials, the accuracy and the Matthews Correlation Measure (MCC) were considered as the main performance 
criteria. The outcomes demonstrated that the AdaBoost- SVM achieved an accuracy of 99.85 and a MCC of 
0.044. Fawaz Khaled Alarfaj [15] proposed a Credit Card Fraud Detection Using State- of- the- Art Machine 
Learning and Deep Learning Algorithms. In this exploration ML styles of Extreme Learning Method, Decision 
Tree, Random Forest, Support Vector Machine, Logistic Regression and XGBoost. The results demonstrated that 


the accuracy XG Boost is 99.92%. 
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“3. Proposed System 


In this section, the proposed techniques used for detecting the frauds in credit card system is to the find the fraud 
transaction of a credit card for the given dataset. The algorithm used for the classification of a dataset of a fraud and 
non-fraud transaction such as Logistic Regression, Decision Tree, Random Forest and Extreme Gradient Boosting 
(XG Boost). Principal Component Analysis Dimensionality reduction is used to protect user identities. The 
Performance of the techniques is evaluated based on precision, recall, f1 score, support, Accuracy, Sensitivity, 
Specificity and ROC Curve is created for showing the performance of a classification. The result that has been 
concluded is that Logistic regression has an accuracy of 99.89% while Decision tree shows accuracy of 99.87%, 
and Random forest shows accuracy of 99.88% but the best results are obtained by Extreme Boosting Algorithm 


(XG Boost) with a precise accuracy of 99.94%. 
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Figure 1. Block Diagram 
(i) Dataset Description 


The well-known credit card fraud detection dataset is used in this study. The dataset includes credit card 
transactions made by clients of credit cards. Just 492 out of 284 807 transactions were fraudulent, creating an 
unbalanced dataset. Due to the change performed on the dataset, all attributes other than "Time" and "Amount" are 
numerical. For confidentiality purposes, these attributes are classified as V1, V2,... ,V28. The "Amount" field 
represents the transaction's cost, while the "Time" attribute represents the number of seconds that passed between a 
transaction and the dataset's initial transaction. The dependent variable, or attribute "Class," has a value of 1 for 


fraudulent transactions and 0 for lawful transactions. 
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Binary class labels 1 and 0 for 


Class label nonfraudulent and fraudulent 


Applied Machine Learning Techniques 
(a) Logistic Regression 


An outcome that has two possible values, such as zero or one, no or yes, false or true, is predicted by the supervised 
classification method known as logistic regression. Logistic regression returns the probability of a binary 
dependent variable that is predicted from the independent variable of the dataset. While logistic regression and 
linear regression have many similarities, logistic regression yields a curve as opposed to linear regression's straight 
line. Based on the use of one or more predictors or independent variables, logistic regression generates logistic 


curves that plot values between 0 and 1. 


ettB,X 


yo 1+ etth,X 


(b) Decision Tree 


An algorithm known as a decision tree employs conditional control statements to forecast the ultimate decision 
using a tree-like graph or model of decisions and their potential outcomes. A learnt function is used to represent a 
decision tree, which is a technique for approaching discrete-valued target functions. These kinds of algorithms are 
well known for inductive learning and have been effectively used for a variety of tasks. A new transaction is given 
a label to indicate whether it is legitimate or fraudulent; the transaction value is then checked against the decision 


tree, and finally, a path is shown from the root node to the transaction's output or class label. 


Entropy(S) = ¥. —p; log» p; 
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(c) Random Forest 
An approach for classification and regression is called Random Forest. In a nutshell, it is a group of decision tree 
classifiers. As it corrects the inclination of overfitting to their training set, random forest has an advantage over 
decision trees. To train each individual tree, a subset of the training set is randomly taken, and after a decision tree 
has been formed, each node is divided on a feature chosen at random from the whole feature set. Because each tree 
is trained independently of the others in a random forest, training is incredibly quick even for big data sets with 


numerous characteristics and data occurrences. 
1 x 

= > fe(R) 

x x=1 


(d) XG BOOST-Extreme Gradient Boosting 


The proposed XG Boost model stands for Extreme Gradient Boosting. Boosting involves several steps. In order to 


enhance the prediction in following rounds, several trees are formed, with the information from the first tree being 
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supplied as input to the second tree. In essence, it is an additive tree model where new trees are added to complete 
the ones that have previously been constructed. XG Boost only functions with numeric data and accommodates 
missing values. XG Boost is a distributed gradient boosting toolkit that has been tuned for quick and scalable 
machine learning model training. A number of weak models' predictions are combined using this ensemble learning 
technique to get a stronger prediction. Extreme Gradient Boosting is one of the most well-known and commonly 
used machine learning algorithms because it can handle enormous datasets and perform at the cutting edge in many 
machine learning tasks including classification and regression .Its effective handling of missing values, which 
enables it to handle real-world data with missing values without requiring a lot of pre-processing, is one of the key 
characteristics of XG Boost. Moreover, XG Boost includes integrated parallel processing capability, allowing you 


to train models on huge datasets quickly. 
- _ yk Min.) teeF 
Yi = Luk=1 f(a i}, fre. 


obj (0) = 2" Uy: Gc) + DHL, Of) 


00%, eo 
Cored, 00° 0, , 008 
000000 8 *o ° 08 
[ Ye) ‘ee | e @0 
riginal Dat 


| cute | car caster 
Classifer 


Y eee| | | coo 


Figure 2. Structure of XG Boost 
(e) Performance-Evaluation Measures 
(i) Accuracy 


Accuracy is used to measure the performance in the evidence domain recovery and processing of the data. The 


fraction of the results that are successfully classified can be represented by equation as follows: 


TP+TN 
TP+FP+TN+FN 


Accuracy = 


(ii) Precision 
Precision is a performance assessment that measures the ratio of correctly identified positives and the total number 


of identified positives. This can be seen as follows: 


TP 


Precision =—H¥ 
TP + FP 


(iii) F-Measure/F1-Score 


The f-measure considers both the precision and the recall. The f-measure may be assumed to be the average weight 


of all values, which can be seen as follows: 
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__ 2x precision x Recall 
Precision + Recall 


F 


(iv) Recall 


The recall is also referred to as the sensitivity, which is the ratio of connected instances retrieved over the total 


number of retrieved instances and can be seen as follows: 


TP 


Recal] =———__ 
TP+ FN 


“4, Experimental Results 
(a) Data Visualisation 


In the experiment, 90% of the data was utilized as the training set while the remaining 10% served as the test set. 
The dataset includes 492 frauds out of 807 deals. It covers only fine input variables, which are the outgrowth of a 
PCA metamorphosis. Due to the issue of concealment, we cannot offer the structures of the original dataset and the 
data more background information. The point Time’ covers the seconds ceased between the first transaction in the 
dataset and each transaction. Figure 3 shows the class distribution of the Credit Card Fraud dataset into a fraudulent 


and non-fraud transactions. 
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Figure 3. Fraud Vs Non Fraud Figure 4. Distribution of class with time 


(b) Accuracy of Machine learning algorithm 


F1 Score 
S.No. | Algorithm Name | Accuracy Score (%) (%) 
‘Oo 
ie Decision Tree 99.87 61.53 
2. Random Forest 99.88 64.4 
Logistic 
3% 99.89 64.19 
Regression 
4. XG Boost 99.94 82.28 
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MEJAST 


The data is analyzed and the behavior and pattern of the dataset and draw the features for further testing and 


training. Here used the labelled dataset. Finally, data are trained using the Extreme Gradient Boosting Algorithm 


(XGBOOST). 


(1) Logistic Regression Result 


In [30]: display_test_results("Logistic Regression”, logistic_model) 
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(2) Decision Tree Result 
In [33]: display_test_results("Decision Tree", decision_tree_model) Wt rrr rnnrrrrrroe More Specific classification_report ----------- 
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(3) Random Forest Result 


In [37]: display_test_results("Random Forest", random_forest_model) ww == --------- More Specific classification_report -------- 
Accuracy:- @.998876443945e862 
Sencaaeoien seats chatting Mbt ese cece Sensitivity:- 6.6e416666ese66666 
[[ses4e 26] Specificity:- @.9995427847923188 
[ 38 58]] Fl-Score:- 0.6444444444444444 
erie Soe Wi SfocSt Soe cee 
50000 
26 
40000 
o 
3 30000 
2 
: 4 
20000 8 
ou 
1 38 58 > 
s 
10000 4 
& 
ov 
. 
° 1 
Predicted label 
------------------ classification_report -------------------- 
precision recall fi-score support 
e 1.20 1.00 1.02 56866 _ 
‘ mes ats 5 es ae ROC curve (area = 0.96) 
accuracy 1.08 56962 , 0.0 0.2 0.4 0.6 0.8 10 
macro avg 0.84 2.80 9.82 56962 
weighted ave 1.00 1.0 1.00 56962 False Positive Rate or [1 - True Negative Rate] 
Figure 7.1. Random Forest Figure 7.2. Random Forest ROC Curve 


(4) XG BOOST Result 


The result occurred in XG Boost accuracy is 99.94.and the f1-score is 82.28. XG Boost efficient way for finding the 


credit card fraud detection is best. 
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In [41]: | display_test_results("xG Boost", xgb_model) 
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Figure 8.1. XG Boost Figure 8.2. XG Boost ROC Curve 
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Out 42); Model Name Accuracy FY-score ROC 
3 XG Boost 0.900456 0.820857 0.978537 
0 Logistic Regression 0.998982 0.641975 0.977606 
2 Random Forest 0998876. 0.644444 0.957597 
{ Decision Tree 0.998771 0.615385 0.921750 


Figure 9. Comparison of algorithm 
= 5, Conclusions and Future Recommendations 


In this paper, Machine learning technique like Logistic regression, Decision Tree, Random forest, and XG Boost 
classifiers were used to detect the fraud in credit card system. Sensitivity, Specificity, accuracy and fl-score are 
used to evaluate the performance for the proposed system. From the experiments, the result that has been concluded 
is that Logistic regression has an accuracy of 99.89% while Decision tree shows accuracy of 99.87% and Random 
forest shows accuracy of 98.88% but the best results are obtained by XG Boost with a precise accuracy of 94.94%. 
However when the learning curves of all the classifiers are evaluated, we see that XG Boost overfits along with 
Random forest and decision tree. Hence we conclude that Extreme gradient boosting algorithm (XG Boost) is the 
best model for detecting credit card fraud detection. In future, the fraud details will be send for the respective card 


owners through email or message. 
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